In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Text classification for SMS spam detection

Outline:

  • Feature extraction using bag-of-words
  • train a binary classifier spam / not spam
  • evaluate on test set

In [ ]:
!head "datasets/smsspam/SMSSpamCollection"

In [ ]:
import os
with open(os.path.join("datasets", "smsspam", "SMSSpamCollection")) as f:
    lines = [line.strip().split("\t") for line in f.readlines()]
text = [x[1] for x in lines]
y = [x[0] == "ham" for x in lines]

In [ ]:
text[:10]

In [ ]:
y[:10]

In [ ]:
type(text)

In [ ]:
type(y)

In [ ]:
from sklearn.cross_validation import train_test_split

text_train, text_test, y_train, y_test = train_test_split(text, y)

In [ ]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(text_train)

X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)

In [ ]:
print(len(vectorizer.vocabulary_))

In [ ]:
print(vectorizer.get_feature_names()[:20])

In [ ]:
print(vectorizer.get_feature_names()[3000:3020])

In [ ]:
print(X_train.shape)
print(X_test.shape)

Training a Classifier on Text Features

We can now train a classifier, for instance a Multinomial Naive Bayesian classifier which is a fast baseline for text classification tasks:


In [ ]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf

In [ ]:
clf.fit(X_train, y_train)

We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:


In [ ]:
clf.score(X_test, y_test)

We can also compute the score on the training set, to see how well we do there:


In [ ]:
clf.score(X_train, y_train)

Visualizing important features


In [ ]:
def visualize_coefficients(classifier, feature_names, n_top_features=25):
    # get coefficients with large absolute values 
    coef = classifier.coef_.ravel()
    positive_coefficients = np.argsort(coef)[-n_top_features:]
    negative_coefficients = np.argsort(coef)[:n_top_features]
    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])
    # plot them
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" for c in coef[interesting_coefficients]]
    plt.bar(np.arange(50), coef[interesting_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 51), feature_names[interesting_coefficients], rotation=60, ha="right");

In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())

In [ ]:
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(text_train)

X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)

clf = LogisticRegression()
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())

Exercises

Use TfidfVectorizer instead of CountVectorizer. Are the results better? How are the coefficients different?

Change the parameters min_df and ngram_range of the TfidfVectorizer and CountVectorizer. How does that change the important features?


In [ ]: